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Abstract 


We herein introduce a general procedure to capture the relevant information from a functional data set in relation to 
a statistical method used to analyze the data, such as, classification, regression or principal components. The aim is 
to identify a small subset of functions that can “better explain” the model, highlighting its most important features. 
We obtain consistency results for our proposals. The computational aspects are analyzed, an heuristic stochastic 
algorithm is introduced and real data sets are studied. 
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1 Introduction 


m 

(^"‘During the past years technological advances made possible the processing and storing of real time information. 
^Consequently, functional data arose across several fields, such as, finance, meteorology, medicine, among others. 

H ence, to handle these data new statistical procedures began to be developed. Classical statistical tools had to be 
|~T Rethought under this framework from a theoretical and empirical approach. In addition, new problems concerning the 
fc^iature of the data had to be tackled. There is a vast literature devoted to these topics, the most classical contributions 
^ire the books by Ramsay and Silverman (133, M) and Ferraty and Vieu m- More recently, the books by Ferraty 
+-And Romain m and Horvath and Kokoszka [23] establish the state of the art on functional data analysis, reviewing 

+J}he advances towards principal components analysis, regression and classification, among other problems. 

C/2 

The problem of feature extraction for multivariate problems has extensively been studied. The methods that show 
^y^etter results usually assess the importance of each variable based on the role that it plays in the statistical procedure 
^eing followed (classification, regression, principal components, etc). The list of strategies available to handle these 
C^roblems is large, however it is worth to mention that there are at least two proposals that have been studied through 
C^everal multivariate procedures, namely the Bayesian model averaging (BMA) and the “least absolute shrinkage and 
selection operator” (LASSO). The former approach are Bayesian proposals introduced by Frayley and Raftery ( [15] 
f—yid p~7]). Therein they analyze the problem of unsupervised classification. Hoeting et al. [24] , extent those ideas to 
the linear model. The LASSO was proposed by Tibshirani [32], who stated that “It shrinks some coefficients and sets 
^—others to 0, and hence tries to retain the good features of both subset selection and ridge regression...The LASSO 
ly^iinimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a 
^^onstant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence 
Lgives interpretable models”. More recently Witten and Tibshirani ([44], [43]) applied those principles to cluster and 
■ r^rincipal components. A detailed description of this method can be found in 1221 . 

The scenario is completely different in the functional data framework where there is much less literature available. 
^Several authors address the problem of variable selection in the regression model. James et al [23] extend the ideas 
of variable selection for regression studying the regression model when the predictor is functional and the response is 
scalar. Their goal is to estimate the regression function smoothly and sparsely, controlling its derivative and extending 
the ideas of the LASSO estimate. They claim that regions where the regression function is not null correspond to 
places where there is a relationship between the predictor and response variables, alternatively when it vanishes there 
is no relationship. They divide the time period into a grid of points. Then they use variable selection methods to 
determine whether the dth derivative of the regression function is zero or not at each time point. They implicitly assume 
that the dth derivative will be sparse. Lee et al. [29] also tackle this problem. They consider a sparse functional 
linear regression model which is generated by a finite number of basis functions in an expansion of the coefficient 
function. They adopt the idea of variable selection in the linear regression setting adding a weighted Li penalty to 
the traditional least squares criterion. Zhou et al 1461 introduce a functional linear model with zero-value coefficient 
function at sub-regions extending the SCAD ideas to this framework. In the three references above mentioned, the 
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procedures described are suitable under different notions of sparsity, an assumption that may be restrictive. Cuevas 
[8, highlights that functional data can be discretized where standard variable selection for high dimensional data 
could be applied. Aneiros and Vieu [3J study the problem of variable selection in the regression setting, when the 
covariates are functionals. They highlight that even though, due to the practical treatment of the data that involve 
discretization, these problems are usually treated as classical high dimensional problems, where p » n, hence ordinal 
variable selection procedures that are suitable for this setting are applied. However, they stress the importance of 
developing new variable selection techniques that take profit of the continuous nature of the variables. Along similar 
lines, Aneiros et al [2] consider the problem of variable selection for high dimensional data in partial least squares, in 
a very general setup. In particular, since they work on an abstract setting they allow to incorporate functional data 
to the model. Gertheiss et al m also consider the problem of variable selection in regression with scalar response and 
functional covariates, they propose a penalized likelihood approach that combines selection of the functional predictors 
and estimation of the smooth effects for the chosen subset of predictors. McKeague and Sen [35] and Kneip et al [26] . 
deal with the problems of functional regression models assuming the existence of an unknown number of points on 
the functional covariates that have special impact on the response. In [5] there are several recent results related to 
these topics. For instance, Aneiros and Vieu 2] introduce a variable selection procedure for the case of linear and 
partial linear regression when some covariates are functional, they take into account the natural specifications of the 
functional variables. Also, Abramowicz et al jjlj present a procedure for performing an ANOVA test for Functional 
Data that as a byproduct selects intervals where the groups significantly differ. The problem of variable selection for 
logistic regression has been studied by Matsui phi] . 

The problem of variable selection in classification has also been studied. Tian and James m introduce an inter¬ 
pretable dimension reduction procedure to classify functional data. On a first stage, they reduce the dimension of the 
data considering projections, then they proceed to the classification step. More precisely, let {X(t), t £ [a,5]} be a 
stochastic process and Y a categorical response variable, without lost of generality Y = {0,1}. Let {/i,..., fd} be a set 
of functions in [a, b ], and denote Zj the projections of X(t) on direction fj, i.e. Zj = X(t)fj(t)dt con j = 1,..., d. 
They perform the classification procedure in the low dimensional space. Hence, their aim is to retain a subset of func¬ 
tions that minimizes the misclassification rate. The key step of the proposal relays on the choice of the set of functions 
{/i,..., fd} , which are selected from the set of constant and linear functions defined on subintervals of [a, b\. They 
also tackle the computational problem describing a competitive stochastic algorithm for choosing fj for j = 1 ,,d. 
Also, Martin-Barragan et al S35] present a support vector machine based classifier for functional data that has good 
classification ability and it is easy to interpret. Fauvel et al m present a nonlinear parsimonious feature selection 
algorithm for the classification of hyperspectral images and the selection of spectral variables. 

Our approach is different, it is designed to be used following a “satisfactory” analysis of the data, by which we 
mean that the data have previously been subjected to classification, principal components or some other form of 
statistical analysis. Additionally, we have a set of functions, A = {/i,..., f p } , that provides relevant features of the 
data. Our aim is to choose a subset of A that retains almost all the relevant information of the data set related 
to the statistical procedure that is being studied. We extend the ideas introduced by Fraiman et al m for several 
multivariate procedures. They assume that a statistical analysis of a multivariate data has been successfully carried 
out, hence their idea is to blind unnecessary or redundant variables (by means of the conditional expectation), keeping 
the subsets of variables that better reproduce the output of the statistical procedure in the high dimensional space. 
Our procedure can be applied to a broad family of statistical problems. In this paper we extend the blinding procedure 
to the functional framework for the problems of classification, principal components and regression. 

The remainder of the paper is organized as follows. We introduce the main definitions in Section 2. In Section 3 
we study the classification problem. Section 4 is devoted to principal components analysis and Section 5 to functional 
regression. Finally, in Section 6 we consider some numerical aspects and analyze several well known data sets. Our 
concluding remarks are made on Section 7. All proofs are given in the Appendix. 


2 Main Definitions and notation 

We start by fixing some notation that will be used throughout the manuscript. Let X : Q, L 2 [a, b] be a random 
element {X =: X(t) : t £ [a, 6]} defined on a rich enough probability space (f l,A,P) where [a, b] C 1 is a finite 
interval. As it is well known, in practice, the data will be available on a grid a < si < ... < sn < b which will be also 
denoted by [a, b\. 

A population statistical model is given by the function i/j(P) := tf)(X,g), for f:I 2 [«,t]xG-> L 2 [a, 6], where G 
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is a separable metric space. The output of the statistical procedure is a random function in L 2 [a, b] given by ip(X,g). 
Let A be a set of known functions, A = {f ±,..., f p , fi : L 2 ([a, b\,P) —► R}, and we denote by 

f(X) =: f =: (/r, {/i(X),..., 

the random vector where each coordinate is the evaluation of the trajectory X on the function fj £ A. 

Some examples of interest are the following: 

1. Pointwise evaluation: fi(X) = X(t i),..., f p (X) = X(t, p ) for a < t\ < t 2 < ■ ■ ■ < t p < b. 

2. Local Averages: fi{X) = |dry f Ti X(t)dt ,..., f p (X) = j^-j- f T X(t)dt, where {T 1; ... ,T p } are disjoint intervals 
of the interval [a, b\. 

3. Occupation Measure: fi(X) = |{f : X(t) £ Ti}|,..., f p (X) = \{t : X(t) £ T p }|, where T\,...,T p are disjoint 
intervals (not necessarily bounded) in R and | • | stands for the Lebesgue measure. That is, the time the process 
spends at each interval, or above or below a barrier. 

4. Number of up crossings to a level: Let fi(X) = Ni , where TV, is the number of up crossings of X to a level 

Ci £ R, * = 1 ,p. Recall that the process X(t) is said to have an up crossing of c at to >0 if for some 

e > 0, X(t) < c if t £ (to — e, to) and X(t) > c if t £ (to, to + e). If the trajectories are differentiable, then 

Ni = card{t £ [a, b] : X(t) = Ci,X'(t ) > 0}. 

5. Moments of the norm of the process: fi(X) = E(\\X\\),..., f p (X) = L1(||X|| P ). 

6. Moments of the process: f\{X) = X(t,u>)dP(ui),... f p (X) = X p {t,u>)dP{ui). 

Remark 1. The choice of the set of functions {/i(A'),..., f p (X)}, depends on the nature of the process X(t) and 
on features of the statistical procedure under consideration that are relevant to achieve a better understanding of the 
practical problem. For instance, pointwise evaluation and local averages reveal information towards the domain of 
X(t ), to detect a subset of points or small intervals at the domain, which explain successfully the phenomena. A 
typical example are spectrometry studies. The occupation measure and the number of up crossings to a level, provide 
information concerning the image of X (t), these characteristics are relevant when measuring the pulse oximetry in 
preterm infants. Finally, the last two proposals should be used if the aim is to retain global information of X(t), these 
features are very important in trading financial problems. 

These sets of functions provide relevant information of a process towards the statistical procedure that is being 
conducted. Our aim is to retain a subset of A, of cardinality d « p that contains “almost” all the relevant information 
provided by f(A) to an specific statistical analysis (classification, regression, etc). To achieve this goal we extend the 
ideas introduced by Fraiman et al m- Therein, they proposed to retrieve relevant information from a multivariate 
model blinding unnecessary variables. It is clear that in the functional data framework a componentwise approach 
does not make sense, hence the blinding procedure must be redefined. 

Given a subset of indices I = {ii < ... < id} C {1,..., p} with d < p, consider the random vector f (I,X) := 
f(J) := (fi 1 (X),...,f id (X))£WL d . 

Definition 1. Let I be a subset of {1,... ,p} denote the stochastic blinded process of X, based on f(I ), to the 
process Z{I) : [a, b] —> R given by 

Z(I)(t) = E(X(t)\f(I)):=r 1 (t,f(I,X)). (1) 

We denote Q(I) to the distribution of Z(I). 

Remark 2. Z(I)(t) is an stochastic process even though it is the conditional expectation given a random vector 

f(!) e R d . 

Given a fixed integer d , 1 < d << p, we let Id be the family of all subsets of {1,... ,p} with cardinality smaller 
than or equal to d. 

We seek a small subset, J, such that if{X, g) is as close as possible to if(Z(I), g). The notion of closeness may vary 
from one problem to another, and is denoted by h(I,P,Q(I),i/>) := h(I). 


3 


More precisely, Iq C Id is defined as the family of subsets in which the minimum h(I) is attained for I £ Id, i.e., 

Zo = argmin Ie x d h(I). (2) 

In practice, we look for consistent estimates of the set Iq, Iq Q Iq based on a sample X \,..., X n of iid trajectories 
of the stochastic process X. 

Given a subset I £ Id , the first step is to obtain the blinded version of the sample, X 1 (I),... ,X „(/), based on 
f(J, X), using nonparametric estimates of the conditional expectation. 

We denote by Q n (I) to the empirical distribution of {Xj(I), 1 < j < n}. 

For instance, we may consider the r - nearest neighbor (r-NN) estimates. We fix an integer value r, the number of 
nearest neighbors used. For each j £ {1,... ,n}, we find the set of indices C 7 of the r nearest neighbors of f (I,Xj) 
among {f (I, X x ),..., f(J, X n )}. Next we define the predicted sample processes as 

X j (I)(t) = Ep(X j (t)\f(I)) = ^ J2 Xm(t) £ L 2 [ a ,b\. 

m£Cj 


Note that the set Cj does not depend on t. 

Then, given a subset of indices I £ Id, we define the empirical version of the objective function h n (I), and we seek 
the optimal subsets of variables Iq C Id, which are the family of subsets in which the minimum of h n (I ) is attained, 

i.e., 


I n = ar gmini £i d h n (I). (3) 

This will become clear through the next sections where we consider different statistical problems and models under 
this approach. 

In order to prove the consistency of ([3]) to ([2]) the following hypothesis must be satisfied. 


HI. Let X(I) be a strong consistent estimate of Z(I), i.e., Z(I) — X(I) 


L 2 [ a,b ] 


Liam et al [30| establish general conditions to obtain consistent estimates of the conditional expectation on a space 
provided with a pseudometric. Hence, HI is a particular case of Liam’s et al result, since we are dealing with variables 
in To be more precise, the conditions imposed by Liam et al can be relaxed, obtaining the following result: 

Proposition 1. Let Z(t) = X)) + e, Z(t) £ H, be a separable Hilbert space, f(I,X) £ R d and e £ H a 

random element with zero mean and independent o/f(J, X). Let r] n be the non-parametric estimate of the conditional 
expectation given by the r nearest neighbors . Given that, 

1. f(/, X) has density function g such that 0 < c\ < g{x) < C 2 for all x in the support o/f(J, X), 

2. (a) || rj(t, f (/, X))||^ < B for all f(/, X) £ R d , 

(b) \\r)(t, f(J, X)) — r]{t, f(J, X)')\\ n < M \\f(I, X) — i(I, Xy\\ 2 (Lipschitz condition), 

3. there exist 5 > 0 such that E (||e||^ i 1 ”' 5 ) < oo, 

4. k/n —> 0, k/logn —> oo, ( logn/k) S ^ 2 < oo, 

then 

||r? n (t,f(/,X)) - r]{t,f{I,X))\\ n = O + a ' s ’ 

Remark 3. Proposition [7] is a direct consequence of Theorem 1 from Liam et al J30j and also from the fact that, in 
K d , under condition 1 due to Lebesgue’s Differentiation Theorem, we have that P {B{x, h)) = O (/i d ). In what follows 
TL = L 2 [a, b]. 
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Remark 4. Kudrazow and Vieu I28 J established general uniform consistency results and convergence rates for r-NN 
generalized regression estimators, where the observed variable takes values in an abstract space. We could have also 
applied these results to achieve consistency. 

Once we have settled the general framework we proceed to define explicitly the theoretical and empirical objective 
functions for classification, principal components and regression. 


3 Supervised and Unsupervised Classification 

Let X(t) £ L 2 [a,b\ be a random process and I\ the number of groups. For supervised classification, as well as 
unsupervised classification (when the number of clusters K is known) we have a function g : A 2 [a, b] —> {1,..., A'} 
that determines to which group (or cluster) each trajectory belongs to. We denote the space partition by Gk = 
g^ 1 {k), k=l,...,K. 

For a fixed integer d < p, our aim is to find a set I C {1,... ,p}, j \I < d , where the population objective function, 
given by, 


K 

h(I) = l-^P(g(X) = k,g(Z(I)) = k), (4) 

k =1 

attains its minimum. Function Q measures the difference between the partition of the space considering the original 
and the blinded trajectory. 

Remark 5. It is clear that instead of minimizing Q) one could maximize h*(I) = J2k=i Ufi'PO = &><?(-^(-0) = k), 
which is the objective function defined by Fraiman et al m for the multivariate case. 

The empirical version of equation Q is given by 


K 






k—1 j=l 


(5) 


where the function g n : L 2 [a,b] —> |1,...,A'} is the empirical space partitioning function, i.e., it determines to 
which group (partition subset) each empirical process belongs. In this case the partition of the space is given by 
G'[ n) = g~ 1 {k), k = meaning that |s| measures the proportion of observations that are classified in 

different groups by the original and the blinded processes. 

To prove consistency results, in addition to HI, the following assumptions are requested. 


HC1. (a) The partition of the space is strongly consistent, i.e., given e > 0, there exists a set A(e ) C L 2 [a,b\, with 
P{X £ A(e)) > 1 - e such that, for all r > 0, sup xeC (e,r) \l{g n (x)=k} - I{ g {x)=k}\ a.s . 0 as n -» oo for 
k = 1,...,AT, where C(e,r ) = A(t) D lC r with IC r is an increasing sequence of compact sets, such that 
P(X £ IC r ) —> 1 when r —> oo. 

(b) d(X,dG JJ) — d(X,dGk) -^ as 0 when n —> oo, where dGk (respec. <9G£) is the boundary of Gk (respec. 

G n k ). 

HC2. P(d(Z(I),dGk) < S) —> 0 when 6 —> 0 for k = 1,..., K. 

HC3. The distribution is non-degenerated, i.e. for every <5 > 0, P(X £ B(x, d)) > 0 for almost every x £ L 2 [a , b\. 

Theorem 1. Let {Xj(t ) : t £ [a, fo]} be iid realizations of the stochastic process X(t). Given d, 1 < d < p, let Id be 
the family of all the subsets {1,... ,p} with cardinality smaller than or equal to d and let Iq C Id be the family of all 
the subsets for which the minimum of equation 0 is attained. Under HI, HC1, HC2 and HC3 we have that for 
each I n £ I n , there exists no(ui) such that, for every n > no(u>), with probability one I n £ Iq- 

The proof is given in the Appendix. 
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4 Principal Components 


Let {A'(£) : t £ [a, 6]} be a stochastic process with finite second moment ^ E (j|^ 2 || L 2 [ a , for almost every 

t £ [a, b], with continuous trajectories. Without loss of generality we assume that it has zero mean ( E(X(t )) = 0). We 
denote v{t,s) = E(X(t)X(s)) its covariance function. As in the finite-dimensional case, the covariance function has 
associated a linear operator, T : L 2 [a,b\ —> L 2 [a,b] defined as 


f b 

(Tu)(t) = / v(t, s)u(s)ds for all u £ L 2 [a, b\. 

J a 


( 6 ) 


We assume that IMIz, 2 [ 0 u x ^[ a b] = ll fa 1/2 (*> s)dtds < oo. Cauchy-Schwartz inequality implies that |ru| 2 < ||^|| 2 |ri| 2 , 
where |u| stands for the standard norm in the space L 2 [a , 6], while ||z/|| denotes the norm in the space L 2 ([a, b] x [a, b]). 
Therefore, T is a self-adjoint continuous linear operator. Moreover, T is a Hilbert-Schmidt operator. 

JF stands for the Hilbert space of such operators with inner product given by 


(I'd, 14)^ = £race(rir 2 ) = 


oo 


/_ J L1 Uj , r 2 Uj ) L2 [a,b]; 

i=1 


where {uj : j > 1} is any orthonormal basis of L 2 [a,b] and (u,v) denotes the ordinary inner product in L 2 [a, b\. 

If we consider the basis of the eigenfunctions of T, {<f>j : j > 1}, we have that ll r lljr = (r,r)jr = = 

J b J b v 2 (t, s )dtds < oo, where {A j : j > 1} are the corresponding eigenvalues of T. In what follows we assume that 
all eigenvalues are different. For any random variable, U, defined as a linear combination of the process {X(s)}, i.e. 
U = J b a(t)X(t)dt = ( a,X ), a £ L 2 [a, 6], we have that var(U) = E{U 2 ) = J b f b s)a(s)dsdt = (a,Ta). 

The first principal component is defined as the random variable U\ = (aq, A) i2 [ a b ], such that, 

Var(Ui) = sup Var ({a,X) L2[ b] ) = sup (a, Ta) L2 , b] , 
ll«llia [ . I » 1 =l V 7 lfolL2 Ki ,]=l 

and the k — th principal component as the variable 

Uk = {Oik, ^) l 2 [ o , 6 ] > (7) 

such that 

Var(U k ) = sup Var ((a, A) i2[a b] ) = sup (a, Ta) L2[a b] 


subject to \\a\\ L2[a b] = 1 and {a, aj) L2[a b] = 0 for j = 


= L 


, k — 1. 


Therefore, if A j > Aj+i, are the eigenvalues of T, Riesz Theorem [40] entails that the principal components are 
obtained from the corresponding eigenfunctions. Let {a k , k > 1} be the basis of eigenfunctions of the covariance linear 
operator T, then U k = (a k ,X) is the k — th principal component and Var{Uk) = \ k - 

Assuming that the first l < p principal components yield a good representation of the original process, for each 
/ £ Id, we define, t4(J) = (a k , Z(I)) L2 r Q . Our goal is to find a subset I such that t4(/) is as closest as possible to 
U k , for every k = 1 ,,1. The objective function is given by, 

i 

h(I) = ^E((U k -L 4(/)) 2 ). (8) 

fc =i 

This function measures the mean value of the square distance between the projection with the original trajectory and 
with the blinded one. Given d < p, our aim is to find the set I £ Id that minimizes the objective function ([8|. 

In order to give the empirical version of (|8|, we shall give the empirical counterpart of the principal components. 
Then, let v n (t,s) be the empirical covariance function, i.e., v n (t,s) = — Xj(t)Xj(s), and T„ its corresponding 

linear operator, given by 



f=i 


(9) 
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where ~V ;/ is the linear operator defined as ( VjU) (t) = f t ' Xj(t)Xj(s)u(s)ds. Hence, Fubini’s Theorem entails that, 
E (Vj) = r for all 1 < j < n. 

Then, the empirical version of function Q is 


l 1 n 


n 

k=l j =1 


( 10 ) 


where U J k = 

L 2 \a,b] and U° k {I) = ( a k , Xj (I)) . The weights of the k—th empirical principal component for 

L ’ J \ / L 2 [a,b] 

the random sample {X ±,..., X n } are a k . In addition, as a consequence of Riezs’s Theorem, we have that, {a k , k > 1} 
is the basis of eigenfunctions of T n associated to {A k ,k > 1}. 


Our aim is to find the set I & Id that minimizes (10). 


To obtain consistency results, in addition to HI, the following conditions are needed: 


H2. E (\\X - Z(I)\\ 2 L2[aM ) <oo. 

HP1 E (||X||^ 2[ab] ) < oo and M\ L 2 ([aMx[aM) = f* v 2 {t, s)dtds < oo. 

Theorem 2. Let { Xj[t) : t £ [a,&]} be a stochastic process satisfying 0. Given d, 1 < d < p, let Id be the family 
of all the subsets of {1,... ,p} with cardinality smaller than or equal to d and let lo C Id be the family of subsets in 
which the minimum of equation |$|j is obtained. Then, under HI, H2 and HP1 we have that for each I n £ I n , I n £ Iq 
ultimately, i.e. I n = Iq with Iq £ Iq for all n > no(cv), with probability one. 


The proof is given in the Appendix. 


5 The functional linear model 

In the functional linear model the response can be either scalar or functional, in this section we analyze both cases. 


5.1 Linear Model with Scalar Response 

Let Y £ R and X £ L 2 [a , b\, the linear model with scalar response is given by, 

Y= f @(t)X(t)dt + e, 

J a 


(ii) 


where (3 £ L 2 [a,b\ and e is a random variable such that E{e) = 0 , E(e 2 ) < oo and E(X(t)e) = 0, for almost every 
t £ [a, b\. 

Let f3o £ L 2 [a , b] such that 


A) = argminp eL z[ atb ]E 1(y~J f}(t)X(t)dt 


( 12 ) 


To ensure existence and uniqueness of /?q under this setting additional conditions are required. 

Assume that E ^||A'||^ 2 [ a b ^j < oo. Without loss of generality, we may assume that the stochastic process is 
centered, i.e., E(X(t)) = 0 for almost every t £ [a, b]. As a consequence of Fubini’s Theorem, is clear that E(Y) = 0, 
since E(X(t)) = 0 and E(e) = 0. 

Let T be the covariance operator of the random element X given in ©>■ 

Denote Im(T) to the closure of Im(F) = {Tiqu £ L 2 [a,b\} . Under mild regular conditions the existence and 


uniqueness of (12) in Im(T) is ensured. For a detailed explanation see Q2j, Chapter 2. 
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We define the objective function as, 


h(I) = E 



Po(t)X(t)dt - 


Po (t)Z{I)(t)dtj 


(13) 


It measures the mean square distance between the predicted value considering the original variables and the blinded 


ones. Given d < p, our goal is to find a subset I & Id that minimizes equation (13). 


In order to give the empirical counterpart of (13) we must provide an estimate of p. There are several ways to 


face this problem. One alternative is to estimate the covariance operator T. Then project the data onto a finite 
dimensional space that grows, as the sample size grow, typically by means of the principal components. Sometimes 
this method is combined with a smoothing procedure. Another alternative is to represent p in a basis of functions of 
L 2 [a, b] satisfying (12) adding a penalty to obtain a regular solution. These basis may not be orthonormal, for instance 
Fourier or Spline basis. 

As mentioned in the introduction this problem has been studied by several authors, among them we can mention 
Cardot et al [5], Cai and Hall [7j, Hall and Horowitz pH], Li and Hsing [3T], James et al [25]. We consider the estimator 
proposed by Cardot et al [B], which is a strong consistent estimate of j3. 


The empirical version of the objective function (13) is given by 


K(I) = - £ 


nb pb 

Pn(t)Xj(t)dt — / Pn(t)Xj(I)(t)dt 


1=1 




(14) 


where P n is the estimate of /3q. Our aim is to find the set I £ Id that minimizes (14). 


In order to obtain consistency results, in addition to HI and H2, the following hypothesis are needed: 


HR1. \\Pn — Po\\ L 2[ a ,b] — >a.s. 0. 

HR2 E{\\X\\l 2[a b] )< oo. 

It is important to note that the regression estimate, /3 n , introduced in j6j satisfies HR1. 


Theorem 3. Let {(Tp Xj(t)) £ K. x L 2 [a, 6]} be iid stochastic processes satisfying (11). Given d, 1 < d < p, let Id be 
the family of all the subsets of {1,... ,p} with cardinality smaller than or equal to d and let Iq C Id be the family of al 
the subsets for which the objective function (lS/\ ) attains its minimum. Under HI, H2, HR1 and HR2 we have that 
for each I n £l n , there exists no(ui) such that for each n > no(w), with probability one, I n £ Iq- 


The proof is given in the Appendix. 


5.2 Linear Model with Functional Response 

Let Y £ L 2 [c, d\ and X £ L 2 [a , b], the regression model with functional response is given by, 

y(s) = f p(t, s)X{t)dt + e(s) s£[c,d], (15) 

J a 

where /3 £ L 2 ([a, b] x [c, d]) and e(s) is a random variable for each s, such that E (e(s)) = 0 and E {X(t)e(s)) = 0, for 
almost every t £ [a, b ], s £ [c, d]. 

Let po £ L 2 ([a, b\ x [c, d]) such that 

B 0 = arq min E 

0eL 2 ([a,b]x[c,d]) 

Under this setting existence and uniqueness of fio are n °t provided unless some additional hypothesis are established. 
Following the same ideas as in the case of scalar response these properties can be obtained. 













Without loss of generality, we may assume that E(X(t)) = 0 for almost every t £ [a, 6], which entails that 
E(Y(s)) = 0, for almost every s £ [c,d ], and also that E (||Aj|| 2 j a < oo and E ^||Y||| 2 j c < oo. Let Tx be the 
covariance operator of X. 

Then under mild conditions, /3o £ L 2 ([a, b] x [c,d]) and also the existence and uniqueness of the solution in the 
orthogonal space of the kernel of the covariance operator of X are obtained. See [12], Chapter 2. 

Once more, following the same ideas as in the case of scalar response, the objective function is given by, 

2 


h(I) = E 


Po{t,s)X(t)dt- / p 0 (t,s)Z(I)(t)dt 


(16) 


L 2 [c,d]y 


This function measures the mean square distance between the predicted functions considering the original process 
and the blinded one. Then, given d < p, we search for a subset I £ Id that minimizes the objective function (161. 


To give the empirical version of (161, we need an estimator of j3, this problem has been studied by several authors, 
for instance, Yao et al [IS] and Muller and Yao [36]. Yao et al, introduce the following estimate 

Pn{t,s) = 2^2^— -^--- 


j '=1 3 = 1 




where ct"-, is an estimate of E They show that (3 n converges in probability to /3 0 . Hence, the empirical 

version of (16) is 

r b r b 


hn ^) - _ X! 


j =i 


Pn(t, s)Xj(t)dt — / Pn(t, s)Xj(I)(t)dt 


(17) 


L 2 [c,d] 


Our goal is to find a subset I £ Id, that minimizes ( |T7| ). To obtain convergence results, in addition to HI and H2, 
we need the following conditions: 

HRF1. \\p n - Po\\ L 2 ([a,b]x[c,d]) ~^P- 0 . 


HRF2. E 


<ooyF(||Y| 


2 

L 2 [c,d] 


< OO. 


It is important to remark that the estimate /3 n , introduced by Yao et al 
Then we have the following consistency result. 


satisfies HRF1. 


Theorem 4. Let {(Yj(s), Xj(t)) £ L 2 [c, d] x L 2 [a, 6]} be iid stochastic processes satisfying (15). Given d, 1 < d < p, 
let Id be the family of all the subsets of {1,... ,p} with cardinality smaller than or equal to d and let I o C Id be the 
family of all the subsets where the minimum of the function (16) is attained. Under HI, H2, HRF1 and HRF2 
given I n £ I n it can be shown that P (/„ £ Iq) —> 1. 


The proof is given in the Appendix. 

Remark 6. If (3 n is a strong consistent estimate of do, then replacing hypothesis HRF1 by 


HRF1*. \\Pn-Po\\ LH[aMx[c „ 


d]) 


0 


strong consistency is attained in Theorem [2] 


6 Numerical Aspects 

This section has two main parts. First we introduce a stochastic algorithm to find the optimal subset of functions. 
Then we illustrate the performance of our proposal analyzing some well known data sets. 
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6.1 The algorithm 


As mentioned in Section [2] our aim is to select, Id, a subset of A = {/i,..., f p }, that better captures the information 
of a stochastic process relative to a statistical procedure, i.e., h n (Id) must be “small”. 

Before describing the algorithm we shall discuss when to consider that h n (Id ) is small enough. For the case of 
classification since h n (Id) is the proportion of observations classified into different groups when we use the original or 
the blinded trajectories, i.e. the “matching error rate”, it is clear how to establish that h n (Id ) is small enough. However 
for the cases of regression and PCA h n (Id ) depends on the units in which the functions are measured (regression and 
PCA). Therefore, our proposal is to rescale the objective function, h n , to make it independent of the units of the data. 
Then, a subset of functions will be chosen if 

h n {Id) < e, (18) 

where e is a positive constant that must be supplied by the user. 

We exhibit the rescaled objective function for each of the problems analyzed in this work. 

For the case of principal components we suggest to rescale the objective function as follows, 


hn(drf) — 


hnifid) 


eLTe?„ (O' 


where h n (Id) is given by equation (10). In an analogue way, the rescaled objective function for the linear functional 

hnijd) 


model with scalar response is given by, 


hn(Id) — 


IE;., (CMm 


2 ’ 


where h n (Id) is given by equation (14). Finally, we consider the linear regression problem when the response is 
functional the rescaled objective function is, 




hn(Id) 


SE"=1 la Pn(t,s)Xj(t)dt 


L 2 [c,d] 


where h n (Id ) is defined by equation ©• In the last three cases the rescaled objective functions measure the relative 
difference between the procedure carried out with the blinded processes and with the original processes. 

Remark 7. For classification h n (Id) = h n {Id), and e is the maximum matching error rate allowed by the user. 

It is clear that if p is moderate or large it is not feasible to analyze the 2 P — 1 subsets of functions, hence we need 
to provide a numerical strategy to tackle this problem. 

We introduce a forward algorithm that has two main steps. On the first one, an exhaustive search over all the 
subsets with cardinality d\ (where di « d) is performed. While the second step is a forward stochastic search. 

Exhaustive Step 

Conside£ allAhe subsets of functions with cardinality smaller than or equal to d\. Keep the Nq subsets with 
smaller h n (Ifi), where Ir = I\, ..., I^ 0 . 


If there is at least one of them satisfying condition (18), stop the algorithm. 


Among all the subset satisfying condition (18) retain the subset with smallest cardinality, in case 
of ties, keep the subset with smallest h n . 


Otherwise, if condition (18) is not satisfied, proceed with the Stochastic Step, where the input are the 


subsets of functions given by I ±,..., Jjv 0 . 

Stochastic Step 


Repeat until condition (18) is attained. 


For each subset Ij, where d = 1,... ,N 0 . 

Choose at random without replacement Ni, ii ,..., i^ 1 functions from I\Ii- 
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Consider the subsets 7^ . =: for j = 1, AT. 

Compute h n (ld,j) for j = 1,... ,Ni- 

Retain the iVo, subsets with smaller h n , with a slight abuse of notation denote them 7i,... ,In 0 - 


If there is at least one of them satisfying condition (18), stop the algorithm. Among all the subset 
satisfying condition (18) retain the subset with smallest cardinality, in case of ties, keep the subset 
with smallest h n . 


Otherwise, if condition (18) is not satisfied proceed with a revision step. 

For each subset 7j, where d = 1,..., N 0 . 

Replace at random, one at a time, each element of 7j, if h n decreases keep the best subset. 


If condition (18) is attained stop the algorithm. 

Otherwise, run the Stochastic Step from the top. 

The algorithm depends on several parameters, namely, e, d -\, Nq and Ni. We have already discussed the roll that e 
plays. The other three parameters can be easily settled and are all in relation with the cardinality A. d\ denotes is 
the maximum cardinality analyzed in the exhaustive step. This number should be small, specially if the cardinality of 
A is big. N 0 is the number of subsets retained after the exhaustive step and is the number of variables added to 
each potential subset chosen either in the exhaustive step or once the stochastic step has been run without achieving 
condition (18), this constants can be moderate, allowing several bifurcations we seek to avoid local minima. 


6.2 Real Data Example 

In this Section we will analyze the behavior of our procedure on several real data examples. Routines are available 
upon request to the authors. 

The growth data set 

The growth data set has been analyzed in several opportunities, and corresponds to the Berkeley growth study. It is 
available in the functional data analysis packages for Matlab and R. The data set consists on the height measurements 
of 54 girls and 39 boys, that were measured on 31 opportunities from 1 to 18 years. The original data was smoothed 
by monotonic cubic regression spline, the curves are shown in Figure [I] (a). 

Our aim is to classify the growth curves for a sample of boys and girls. Since there is no learning and test sample 
we shall proceed following the ideas from Lopez-Pintado and Romo [52] . Therein, they suggest to separate the data 
into four groups of the same size and consider one of them as test group. The remaining observations constitute the 
learning set. The data set is classified by fitting a functional generalized additive model (see [lO]) and the number of 
misclassified observation is computed from the test group. All these steps are repeated changing the test group 4 times 
and the total error rate is defined as the number of misclassified observations over the total number of observations. 
It is clear that different partitions yield different misclassification rates, to avoid this effect we repeat this procedure 
50 times. The misclassification rate is between 3.23% and 10.75%, the mean is 6.17% and the median is 6.45%, which 
are very good results in comparison to those obtained by Lopez-Pintado and Romo [321 . 

Our goal is to find which are the key time instants that determine that boys and girls have different growth patterns. 
Then, if the set A is conformed by the evaluations on the grid of time points, we apply the blinding procedure to 
these functions. The cardinality of A is 31, hence it is feasible to perform an exhaustive search. The nonparametric 
estimation is done by r-nearest neighbors, with 7' = 3 or 5. If d\ = 1, the mean matching error is high 21.94% (respc. 
22.09%) for r = 3 (respc. r = 5.) Then we decided to set d\ = 2, this means to look for the pair of functions with 
minimum matching error that belongs to A. In these cases the mean matching error decreases to 4.75% (respc. 4.77%) 
for r = 3 (respc. r = 5.) In Figure [l] (b) (respc. (c)) we show the histogram for the instants chosen by the blinding 
procedure, indicating with different colors the first and second choice, in almost every case the final height is a relevant 
feature, boys tend to be taller than girls. For the other measure there are clearly three modes, the first one is the 
initial measurement (people tend to grow in the same percentile curve throughout their lives), the second one is close 
to 6 years (there is an acceleration in the growth speed) and the third one during the puberty (girls grow earlier than 
boys). Hence the time points chosen by the algorithm reflect important patterns on the growth charts. 

It is important to remark that histograms corresponding to r = 3 or r = 5 are very similar. 
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Figure 1: (a) Height curves, the darker ones are from the boys and the lighter ones for the girls, (b) and (c) The 
histograms show the ages chosen by the blinding procedure, the green ones represent the first time and the purple one 
the second one, for r = 3 or r = 5. 


The phoneme data set 

In the following example, the data corresponds to the discretized log-periodograms, which is a widely used method 
for casting speech data in a form suitable for speech recognition. The learning data set, as well as the test data sets, 
contain 250 recordings of men pronouncing five phonemes. Each curve is recorded on a grid of 150 equally spaced 
time points. The phonemes are “sh” as in “she”, “iy” as the vowel in “she”, “del” as in “dark”, “aa” as the vowel in 
“dark”, and “ao” as the first vowel in “water”. The label of each observation is known. 

The dataset is in the R package fda. use (Functional Data Analysis and Utilities for Statistical Computing) and it 
is a part of the original one which can be found at http://www-stat.stanford.edu/ElemStatLearn 

The data set is classified fitting a functional generalized kernel additive models (see m) and only 6.4% observations 
of the test sample are misclassified. 

The set of functions A, are the evaluations on the 150 grid of time-points. 

In order to apply the blinding procedure we must fix the values of the parameters involved at each stage in the 
algorithm. The maximum matching error rate is e = 0.10. For the exhaustive step we set di = 1, since A has 150 
functions. For the stochastic step we used IVo = 3 or 6 and N\ = 5 or 10. In addition, for the number of nearest 
neighbors for the nonparametric estimation we considered either r = 5 or 10. We performed 100 replicates for each 
parameter configuration. 

In Table [I] we can observe that the mean number of functions chosen is always between 4.5 and 5, in addition 
we may add that in every case between 3 and 6 functions where selected and that the median number of functions 
is always either four or five. Moreover, in Table [2] we report the mean matching error rate, which is in every case 
between 8.2% and 8.5%. The minimum matching error rate in every case is between 4 or 5 %. The results obtained 
by the blinding procedure using different values for the parameters are practically the same. 



r 

5 


10 

iVi 

N 0 3 

6 

3 

6 

5 

4.93 

4.64 

5.02 

4.67 

10 

4.68 

4.47 

4.79 

4.47 


Table 1: Mean number of functions chosen for each parameter configuration. 



r 

5 


10 

w 

N 0 3 

6 

3 

6 

5 

8.48 

8.31 

8.44 

8.34 

10 

8.32 

8.20 

8.34 

8.36 


Table 2: Mean matching error rate. 


The Canadian weather data 

This data set has been introduced first by Ramsay and Silverman m and it is available in the R package fda 
(see [39]). The data contains daily temperature and rainfall observations over the course of a year measured on 35 
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monitoring stations from Canada. Several authors analyzed this data set in the context of scalar regression (see [20] 
and references therein), their aim is to predict the log annual precipitation from the temperature observations. 

We estimate the functional linear model as proposed by Cardot et al [5], i.e. considering an almost sure consistent 
estimate. This procedure has been numerical implemented by Goldsmith and Scheip [2D], they show that it has good 
empirical performance. 

Our aim is to find the time periods where the temperature has greater influence in the prediction of the log 
annual precipitation amount. Hence, the set of functions A, that we analyze is conformed by 40 local averages of the 
temperatures from non-overlapping intervals of 9 days, fi,..., / 40 , and one local average with the remaining last five 
days, /41. This means that f± is the mean temperature from day 1 to 9, /2 is the mean temperature from day 10 to 
18 and so on and so forth. We apply the blinding procedure to A. The cardinality of A is 41, then it is feasible to 
perform an exhaustive search. The nonparametric estimation is done by r-nearest neighbors, with r = 3 or 5. 

In Table [ 3 ] we exhibit the optimal functions chosen and also the value of the rescaled objective function, h, for 
d = 1, 2,3 or 4. We observe that h decreases on d, there is a bigger gain if we choose two functions instead of one, but 
practically no gain if more functions are chosen. The functions mainly retain information that correspond with two 
larger periods of time, the first one during the summer and the other one during the autumn. In Figure [ 2 ] (a), the 
temperature functions are exhibit, the gray rectangles highlight the areas chosen by the exhaustive procedure. Even 
though, there are not much differences for r = 3 or 5 nearest neighbors the objective function shows better behavior 
for r = 3. 



3 

r 

5 


di 

fi 

h 

fi 

h 

1 

35 

0.0013 

37 

0.0013 

2 

26, 35 

0.0010 

20, 37 

0.00117 

3 

26, 36, 37 

0.00094 

19, 35, 37 

0.00116 

4 

24, 35, 37, 38 

0.00092 

22, 24, 36, 37 

0.00113 


Table 3: Functions chosen by an exhaustive search with their corresponding h values. 

In addition, we also run the stochastic algorithm, first we fix the values of the parameters involved at each stage 
of the algorithm. For the exhaustive step, we set d\ = 1. For the stochastic search, we fix, e = 0.0012, No = 3 or 6 
and Ni = 5 or 10. The number of nearest neighbors remains the same as in the exhaustive search. We perform 100 
replicates for each parameter configuration. For r = 3 in almost every case two functions are chosen (Figure [ 2 ] (b) to 
(e)), while for r = 5, even though the median number of functions is always 2, there is more dispersion (Figure [ 2 ] (f) 
to (i)). 

In Figure[3](a) to (d) we show the histograms for the functions chosen, for r = 3, by the algorithm for the different 
parameter configurations, when two functions are chosen, we indicate with different colors each choice. In every case, 
one of the periods chosen is during the autumn and in most of the cases the other one is during the summer, this 
results are similar as those obtained by the exhaustive search. If more than two functions are chosen, in almost every 
case, two of them lay during the autumn and the remaining during the summer. The results for r = 5 are similar. 

In addition, we change the set of functions considered in this example, we use the local averages during the twelve 
months, A. We run the exhaustive procedure, if d± = 1 the month chosen is November, which includes functions 
/ 35 , / 36 and / 3 7 from A. If d\ = 2 , November is one of the months chosen while the other one is August for r = 3 and 
July for r = 5, these months are contained in the subsets of functions / 21 ,..., /27 from A. In every case, the difference 
between the values of the h with A or with A is smaller than 0 . 0002 . These results are consistent with those obtained 
by Lee et al [2D] and by James et al [25] that also analyzed this problem in the context of sparse regression. 

Finally, we analyze this data set from a different perspective. Our goal is to determine temperature bands that 
have greater impact on the prediction of the logarithm of the annual precipitation. Hence, we consider the set of 
function A = {/ 1 ,... 5 / 12 }, where /,; denotes the occupation measure in a band of 5 Celcius degrees, i.e., fi = |{£ : 
(—35 + 5 (i — 1 )) < X(t) < (—35 + 5?')}|, for i = 1 ,..., 12 . Since the cardinality of A is small we perform an exhaustive 
search, once more the nonparametric estimation has been performed for either r = 3 or r = 5 neighbors. In Table[4]we 
exhibit the optimal values of the rescaled objective function, h. For r = 5 it is clear that the maximum gain is achieve 
if two functions are kept, for r = 3 it is not clear if two or three functions should be retained. In every case, one of 
the functions retained correspond to an intermediate temperature, while the other one corresponds to a an extreme 
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50 100 150 200 250 300 350 


(a) 




Figure 2: (a) Temperature dataset, (b)—(i) The histograms show the number of functions chosen by the algorithm 
for the different parameter choices, (b) r = 3, iVo = 3, Ni = 5 (c) k = 3, N 0 = 3, Ni = 10 (d) r = 3, N 0 = 6, Ni = 5 
(e) r = 3, iV 0 = 6, = 10 (f) r = 5, JV 0 = 3, JVt = 5 (g) r = 5, N 0 = 3, N = 10 (h) r = 5, N 0 = 6, TVi = 5, (i) 

r = 5, N 0 = 6, ^ = 10. 



(a) (b) (c) (d) 

Figure 3: The histograms show the distribution of functions chosen by the algorithm, for r = 3, when two functions 
are chosen for the different parameter configurations, (a) iVo = 3, Ni = 5 (b) Nq = 3, Ni = 10 (c) Nq = 6, N-[ = 5 
(d) N 0 = 6, N = 10 


value that can be either cold or hot, Figure [4] is exhibit as example. 



r 


di 

3 

5 

1 

0.0036 

0.0044 

2 

0.0030 

0.0030 

3 

0.0027 

0.0029 

4 

0.0025 

0.0028 


Table 4: Optimal values of h corresponding to the exhaustive search. 
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Figure 4: Temperature bands chosen for r = 3. 


7 Concluding Remark 

In this paper we address the problem of feature selection when the data is functional, we study several statistical 
procedures including classification, regression and principal components. One advantage of the blinding procedure 
is that it is very flexible since the features are defined by a set of functions, relevant to the problem being studied, 
proposed by the user. Our method is consistent under a set of quite general assumptions, and produces good results 
with the real data examples that we analyze. In any of the statistical procedures studies sparsity has been assumed. 

From the numerical aspects, it is clear that if the cardinality of A is moderate or large an exhaustive search in 
unfeasible, hence an stochastic heuristic algorithm is introduced. The algorithm depends on several parameters, and 
an optimal way to choose them is beyond the scope of this paper. 

A possible shortcut of our approach is that we are assuming that the cardinal p of the set of functions {fi, ■ ■ ■, f P } 
is fixed, although it may be large. An extension to the case when p —► oo - besides a much more involved theory 
- would need in particular, some asymptotic results which ensure that assumption HI holds. Then, the regression 
function should fulfill Besicovitch condition, see for instance Forzani et al M- An important issue is to extend our 
results on feature selection to this framework in the next future. 

Several authors, among them Cuevas 0 and Aneiros et al [2i suggest that the link between variable selection 
methods for high dimensional data and functional data should be deeply studied. In particular they focus on the 
idea considering a discrete representation of the functions and determine whether all the points are relevant for the 
statistical procedure carried out or if it is enough to observe some key points. Ferraty et al m and Kneip and Sarda 
m consider different approaches to tackle this problem for the regression problem. We consider that the question 
the pose is very relevant (and even though in this work we deal with it), it should be deeply studied. More general 
statistical problems should be considered and theoretical results concerning the number of points that should be chosen 
should be studied. 


8 Proofs 

Proof. (Theorem [l]) Since Id is finite it is enough to show that lim^-^ h n (I) = h(I) a.s., for all I £ Id- That is, for 
k = l,...,K, 

1 ” 

^-J2 I { 9 Ax j )=k} I {gn( x J (i))=k}= P (9(X) = k,g{ z {I)) = k) a.s. (19) 

j =i 

We define the unobservable trajectories Z n (I ), where 

Z j {I){t)=E P (X J (t)\f(I,X j )), (20) 

and denote by Q* (/) it’s empirical distribution. 

Equation ( |T9| holds, if for every fixed k. 

1 " 

n Y, I { 9 n(x j )=k} I {g n (z j (i))=k} = P{ 9 {X) = k, g(Z(I)) = k) a.s. (21) 

3 = 1 
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and 


1 

lim -V'/{ g „(x,)=fe} 

n—>• oo ft Zy 1 v J 

i=1 


^{g„(X,-(/))=£} hsniZj (/)) = *} 


= 0 a.s.. 


We derive (21) from the following statements, 


n ^ Q - I {g n (X j )=k}I{g n {Z ] (I))=k} - I{g(X ] )=k}I{g^Z j {I))=k} = 0 a.S. 


3=1 


( 22 ) 


(23) 


and 


= P(g{X) = k,g(Z(I)) = k) a.s. 


n—>■ oo 77, 


(24) 


3=1 


Equation (24) follows from de Law of the Large Numbers. To proof (23) we observe that the left hand side of the 


equation is dominated by 

1 
n 

1 
n 


H l / {ff»(*i)=fc} / {3n(^(i-))=fc} “ I {g{X j )=k}I{g(Z :i (I))=k}\ + 

{x j eCM}n{z(/) j eC(e,r)} 

{X^C(e,r-)}U{Z( 7 )^C(e,r)} 


where C”(e, r) is given in assumption HCl(a). From HCl(a) it can be seen that the first term is clear that it converges 


to zero for every e and r. On the other hand, the last term vanishes due to the Law of Large Numbers. Then (23) 


holds, and the proof of (21) is completed. 


To proof equation (22), it is enough to show that 

tt{.7 : 9n{Xj{I)) = k,g„(Zj(I)) ^ k,g n {Xj) = k}/n -7 0 a.s., 


(25) 


and 


it {j ■ g n (Xj(I)) ^ k 1 g n {Z j {I)) = k,g n (Xj) = k}/n -> 0 a.s. 


(26) 


We define the sets B, C and D as follows: 


B — {w £ : || Zj(I) - -£j(-OIIi,2[o,i] ~^n -¥oo 0 Vj}, 

Cj = {w g ft : d{Xj,dG k ) - d(X j: dG k ) -> 0} and C = ft f= x Cj, 

D = {id £ : d(dG%,dG k ) -^n-s-oo 0}. 

From HCl(b) and HI we have that P(B D C) = 1, and from HC3 and HCl(b) we have that P(D) = 1. Hence, 
P{B C\Cr\D) = l. Then, given <5 > 0 and uj £ B C\ C (1 D, there exists n 0 = n 0 (oj , d) such that maXj—i t ,.. t „\\Zj{I) — 

XjWWhp,!] < V 2 - 

Given id £ B C\ C C\ D, d > 0 and n > no(w, S) we have that: 

{j : g n (Xj(I)) = k, g n (Zj(I)) ± k,g n {Xj) = k} C {j : d{Zj, dG k ) < 2d}. 


This implies that the left hand side of (25) is dominated by ^jj {j : d(Zj,dG k ) < 25} < ^'}2]=i^{d(z j ,aG k )<2S}^ 
which converges a.s. to P(d(Zj,dG k ) < 2d) when n —► oo. 


Finally, from HC2 we get that lims^oP(d{Z, dG k ) < 25) = 0, which concludes the proof of (251. The proof of 


(26) is completely analogue. 
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Proof. (Theorem [2| 

In order to prove our statement it is enough to show that (101 converges to ([8]) a.s. To simplify notation, without 
loss of generality, we consider only one principal component (i.e., 1=1), which will be denoted by a = a\. Then we 
have, 


w-Xjiimdt) , 


and 


n n n ( 

k{i) = -- u{{i)f = -- (c* n ,xAm 2 = - E / 

” 3 = 1 n 3=1 " 3=1 \ Ja 

h(I) = E P ((C/i - U^I)) 2 ) = E P (((a,X) - ( a,Z(I ))) 2 ) = E P ( (^j\(t){X(t) - Z(I)(t))dtj 


We define the unobservable functions Zi(I),..., Z n (I), where 

zAm) = e {x.mfh (Xj),..., f id {x 0 )) , 

and denote by Q* (/) it’s empirical distribution. 


(27) 


h n (I) = + 

n / pb \ 

E (/ (ZiVm - xAm) <%) 


i 

n 


+ 


2 

n 


E(/ q a”W(X J .(t)-^(/)(t))dtj ^ a n (t)[z 3 {I){t)~X 3 m))dty 

First, we show that (28) converges to <§• 

\E (jl an w (*>(*) - wx*))^ ^ E ( f a («"(*) - “(*)) poco - w)(*))+ 
E “(*) pow - ^ww) d^ + 

£ N b (a”(t)-a(t))(X i (t)-Z j (/)(t))dtJ (j'aWiXiM-Ztfmdtj . 


1 

n 


2 

n 


By Cauchy-Schwarz’s inequality we see that the right hand side of (31) converges a.s. to zero, 

1 n ( r b \ 2 i « 

-E / < I|a" - ofe - E ll X i - Z A T )\\% • 

j =1 V- 70 / 2=1 


(28) 

(29) 

(30) 

(31) 

(32) 

(33) 


Dauxois et al. 9], under mild regular conditions, show strong consistency of the eigenvalues and their associated 
eigenvectors (see Propositions 2 and 4 therein). They establish that it is enough to show the convergence of the 
covariance matrix in the operator space norm, assumption HP1 is required. More specifically, they prove that if 
||T„ — T\\ f —► 0 a.s., then \\a% — a \\ L 2 ^ —■► 0 a.s., for all 1 < k < p, where af. (respectively a*,) are the eigenvectors 
of T n , which is the empirical covariance matrix associated with P n (respectively T, which is the covariance matrix 
associated with P). 

Then ||a£ — cz \\ L 2 ^ converges a.s. to zero. By the Law of Large Numbers ^ H-^j — z j(I )|| 2 2 is bounded, 

even more, since {\\Xj — Zj(I)\\, for j = 1,.. . ,n} are iid random variables with finite second moment (assumption 
H2), we have that, 

n 

- E - zmh b] -»• Ep (\\X - Z{I)f) a.s. 

3=1 


17 





We will prove that (32) converges to ([8]). By the Law of Large Numbers, given that 

{/o' a(t)(X 3 (t) — Zj(I)(t))dt, for j = 1,... , n| are iid random variables with finite second moment. Assumption 


H2 and Cauchy-Schwarz’s inequality ensures this statement since 

2 


j\(t)(x 3 (t) - z 3 {i)(t))dt ) < M 2 Ll Jx 3 - z 3 (i)\\l l b] , 


we have that, 


^E {^f a <*(t)(Xj(t) - z AW))dtj -> E ^j\(t)(X(t) - Z(I)(i))dtj j 


a.s. 


By Cauchy- Schwarz’s inequality it can be seen that (33) and (30) converges to zero. An also (29) vanishes by 
Cauchy-Schwarz’s inequality and assumption HI. 


□ 


Proof. Theorem [3] 

We need to proof that 


( [” p n (t)X 3 (t)dt- f p n (t)X 3 (I)(t)dt\ ->E\( f (3 0 (t)X(t)dt- f f3o(t)Z(I)(t)dt) 

j =1 a J a J \ a ^ a J 


a.s. 


We consider the unobservable functions defined in (27) and observe that 

h n (I) = iit ^j\n(t)X 3 (t)dt- j\ n (t)X 3 (I)(t)d?j 

= lit( K jy^(x j (t)-z j (im)d?j + 

n ( \ 2 

Ely put)(z 3 (m-x 3 m))dt J + 


1 

n ■ 

2 
n 


j=i 


E yf a PnWix^-zjiimdtj ^ p n {t) (z 3 (i)( t ) - x 3 (im) 


dt . 


(34) 

(35) 

(36) 


First, we are going to proof that (34) converges to h(I). It is clear that (34), 

2 


l E A*(f) - z 3 (l)(t)) <ttj = i it {fin(t) - fioit)) {X 3 (t) - Z 3 (I)(t)) dtj + (37) 

l E ( J' fioit) ( x j{t) ~ Z 3 (I)(t)) dtj + 


E ( / ifinit) - fioit)) ( Xj(t ) - Z 3 (I)(t)) dtj ^ /3o (t) (X 3 (t) - Z 3 (I)(t )) dtj . 


(38) 


(39) 


We will show that the right hand side of (37) converges to zero. By Cauchy-Schwarz’s inequality, we have that, 

1 n ( r b \ 2 1 n 

- E / W) - fioit)) {Xj(t) Z 3 (I)(t)) dt < || (3 n - ~ E W X i - Z ^)Wl>[aA ■ 

i=i \ Ja J i=l 

Since ||/3 n — /3o||z,2[ a b j converges a.s. to zero by assumption HR1 and by the Law of Large Numbers we have that 
\\Xj — Zj(I)\\ 2 L 2 { a b ] is finite, because |||Xj — Z 3 (I)\\ 2 L2 ^ a b j for j = 1,..., n| are iid random variables with 


1 
n 

finite second moment (assumption H2). 
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We will show that (38) converges to h(I). By Cauchy-Schwarz’s inequality we have, 


A)(t) (X(t) - Z(I)(t)) dt < WMlnaM II* - Z(I)f L2[ab] , 


then, 


E yfj\(t){X(t)-Z(I){t))dt^ j < ||^o|lia [0)6] E (||X - Z{I)W 2 L2[a ^ . 

Given that | J ^ Po(t) (. Xj(t ) — Zj(I)(t)) dt for j = 1, ..., n| are iid random variables with finite second moment (as¬ 
sumption H2), by the Law of Large Numbers we have that, 

^E ^E P Ujy o (t)X(t)dt-j\ 0 (t)Z(I)(t)d?j j =h(I)a.s. 


By Cauchy-Schwarz’s inequality it can be seen that (36), (35) and (39) converge a.s. to zero. 

Proof. (Theorem |4j) We need to proof that h n (I) —¥ h^I) a.s., that means that, 

1 


n 


E 


I 2 


/3 n (s,t)(Xj(s)ds - Xj(I)(s))ds 


i =i 

Observe that 


dt 


-> E 


-i 2 


P(s,t)(X(s) - Z(I)(s))ds 


dt 


MC = - £ 


3=1 


1 2 


1 " 
= l£ 

1 n 

iV 

r) ^ 


Pn(s, t)(Xj(s)ds - Xj(I)(s))ds 

a 

[ p n (s,t)(Xj(s)ds - Zj(I)(s))d. 

J a 


dt 


dt 


3=1 


-i 2 


Pn(s,t)(Zj(I)(s)ds - Xj(I)(s))ds 


dt 


9 " 

-E 

t=i 


Pn{s,t)(Xj(s)ds - Zj(I)(s))ds 


Pn{s, t)(Zj(s)ds - Xj(I)(s))ds 


dt 


First we are going to see that (40) converges to h(I) 

1 2 


i n r d 

X 


3=1' 


Pn(s,t)(Xj(s)ds - Zj(I)(s))ds 


n r d 


dt = -Y 


3= 1' 


/3(s, £)(Xj(s)ds — Zj(/)(s))ds 


1 

--E 

n Z -—^ 


n pd 


3 = 1* 


--E 

T? 2_—/ 




i=i' 


P(s,t){Xj(s)ds — Zj(I)(s))ds 


(Pn(s,t) - P(s,t))(Xj(s)ds - Zj(I)(s))ds 


(Pn{s,t) - P( s d))(Xj(s)ds - Zj(I)(s))ds 


We are going to proof that (43) converges to h(I). Since f d /3(s,t)(Xj(s)ds — Zj(I){s))ds 


□ 


(40) 

(41) 

(42) 


dt (43) 

dt (44) 

dt (45) 


dt has finite first 


moment, by Cauchy-Schwarz’s inequality we have that 


[ [ P{s,t)(Xj(s)ds — Zj(I)(s))ds 

J c J a 


dt < ||Xj - Zj(/)||| 2 [ a)b ]||/3||£2([ 0)b ] X [ C)d ]). 
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Then 


E 


f 


1 2 


P(s,t)(Xj(s)ds — Zj(I)(s))ds 


dt 


^ ll^llL 2 ([a,b]x[c,d]) 


E(||X i -Z,(J)||| 2M] ) ) 


which is finite by assumption H2. 

Since (^ \f*/3(s,t){X j (s)ds-Z j (I)(s))ds 


dt > are iid variables with finite first moment, by Law of the Large 


Numbers, we have that (43) converge a.s. to (16). 


Assumptions H2 and HRF1, and Cauchy-Schwarz’s inequality guarantee that (44) converges a.s. to zero. 


As a consequence of Cauchy-Schwarz’s inequality it can be seen that (451 converges a.s. to zero. 


As a consequence of assumptions H2 and HRfl, and by Cauchy-Schwarz’s inequality it can be seen that (41) 
converges a.s. to zero. Finally it can be seen that that assumptions HI, H2 and HRF1, and Cauchy-Schwarz’s 
inequality guaranty that (42) converges a.s. to zero. 


□ 
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